Skip to content

HADOOP-19857. Enable unit tests on GHA#8452

Open
pan3793 wants to merge 42 commits intoapache:trunkfrom
pan3793:gha-enable-tests
Open

HADOOP-19857. Enable unit tests on GHA#8452
pan3793 wants to merge 42 commits intoapache:trunkfrom
pan3793:gha-enable-tests

Conversation

@pan3793
Copy link
Copy Markdown
Member

@pan3793 pan3793 commented Apr 22, 2026

Description of PR

This PR enables running unit tests in parallel on GitHub Actions, with ~150 test suites(classes) excluded due to flaky or consistently failing on GHA that need to be fixed later, all of the remaining test suites take ~2.5 hours to complete. As a comparison, Jenkins takes ~30 hours.

How was this patch tested?

GHA report.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (HADOOP-19857)?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

AI Tooling

No AI usage.

Comment thread .github/workflows/tmpl_build_and_test.yml Outdated
options: --user ${{ needs.build-image.outputs.uid }}
strategy:
fail-fast: false
max-parallel: 6
Copy link
Copy Markdown
Member Author

@pan3793 pan3793 Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, this runs in the contributor's forked repo, it's free but has concurrency limitations, also remember that each split requires a build, larger parallel splits consume more overall time.

include:
- comment: hdfs
modules:
-pl :hadoop-hdfs
Copy link
Copy Markdown
Member Author

@pan3793 pan3793 Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this single module takes more than 4 hours to complete (even though more than 20 tests are excluded for now), we need to improve the slow tests and use tags to split it into more groups

Copy link
Copy Markdown
Member Author

@pan3793 pan3793 Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, it actually takes 5 hours to complete (with ~30 tests excluded).

I split it into 2 groups:

  • slow, ~70 classes, which single test class takes more than 60s, all of them take ~150min
  • other, the rest of them

hope it can be completed in 2.5 hours next round

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajfabbri, given the situation, I expect we may

  • exclude fewer than 200 test classes
  • keep others running stable on GHA, within 3 hours

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not ready but almost here, with ~160 test suites (classes) excluded, all remaining tests run successfully in 2.5h

https://github.com/pan3793/hadoop/actions/runs/24870046901

I still need to re-run several rounds to ensure the initial test list runs stably.

@ajfabbri @steveloughran @slfan1989 would be great if you could take a look first.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pan3793 Will look at it when I get back. I am on vacation until tomorrow night.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. We have some work to do on improving tests: We need to make them faster, and completely eliminate flaky tests (ideally). This gives us an iterative way to tackle it. 👍

@pan3793 pan3793 force-pushed the gha-enable-tests branch 4 times, most recently from 8a5ffe4 to 1dcdec9 Compare April 23, 2026 02:34
@apache apache deleted a comment from hadoop-yetus Apr 23, 2026
@apache apache deleted a comment from hadoop-yetus Apr 23, 2026
@apache apache deleted a comment from hadoop-yetus Apr 23, 2026
@apache apache deleted a comment from hadoop-yetus Apr 23, 2026
@apache apache deleted a comment from hadoop-yetus Apr 23, 2026
@pan3793 pan3793 force-pushed the gha-enable-tests branch 3 times, most recently from 1a874a0 to 5ee44de Compare April 24, 2026 11:40
@apache apache deleted a comment from hadoop-yetus Apr 24, 2026
@apache apache deleted a comment from hadoop-yetus Apr 24, 2026
@apache apache deleted a comment from hadoop-yetus Apr 24, 2026
@apache apache deleted a comment from hadoop-yetus Apr 24, 2026
@apache apache deleted a comment from hadoop-yetus Apr 24, 2026
@apache apache deleted a comment from hadoop-yetus Apr 24, 2026
Copy link
Copy Markdown
Contributor

@ajfabbri ajfabbri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an initial review of the current Draft. I had a couple of questions, but it looks good overall.

#

# hadoop-common
**/org/apache/hadoop/ipc/TestRPC.java
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these are just slow, and which tests are flaky / failures? Would be good to leave comments here and separate into sections, e.g.

# hadoop-hdfs
...
# slow
...
# flaky

for each project.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this list does not contain slow tests, only flaky or consistently failing cases

TBH, it will take a lot of effort to classify that, but it might not be worth doing. Contributors who want to improve it are easy to classify by running the test locally a few times, fixing a flaky or consistently failing test does not have much difference IMO

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Please add a comment at the top of the file, e.g. "Flaky and/or broken tests needing attention.". The intention is that we will fix or remove these over time. Slow tests are handled with the annotation. 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the README.md under the same folder to explain that

include:
- comment: hdfs
modules:
-pl :hadoop-hdfs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. We have some work to do on improving tests: We need to make them faster, and completely eliminate flaky tests (ideally). This gives us an iterative way to tackle it. 👍

package org.apache.hadoop.oncrpc;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertFalse;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a separate commit. When you are ready to merge, we can do an interactive rebase to squash all commits except this one, and add a descriptive log message to this commit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will split this PR into independent PRs, each of which focuses on one topic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine with me.

}

@test "hadoop_stop_daemon_force_kill" {
skip "Skip on GitHub Actions now"
Copy link
Copy Markdown
Contributor

@ajfabbri ajfabbri Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we disabling this? Does this affect Yetus CI? We could do a conditional skip based on an environment variable which is only set for GH actions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, need to change the condition to skip only on GHA.

# Verify that host directories get mounted without z option
# and INFO messages get printed out
@test "start-build-env.sh (Docker without z mount option)" {
skip "Skip on GitHub Actions now"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.


# Verify that host directories get mounted with z option
@test "start-build-env.sh (Docker with z mount option)" {
skip "Skip on GitHub Actions now"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

<configuration>
<forkCount>0</forkCount>
</configuration>
</plugin>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I agree we should fix the underlying issue. How can we make sure we catch "silent VM crashes" on jenkins?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok if we don't have an answer, but should we start a mailing list thread, or file a JIRA? Just don't want to accidentally, silently, break a test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change last year, can't recall the details ... will revert it for now.

@github-actions github-actions Bot removed the YARN label May 8, 2026
@pan3793 pan3793 changed the title [DO-NOT-MERGE] Enable tests on GHA HADOOP-19857. Enable unit tests on GHA May 8, 2026
@pan3793 pan3793 marked this pull request as ready for review May 8, 2026 13:02
@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented May 9, 2026

@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@apache apache deleted a comment from hadoop-yetus May 11, 2026
@slfan1989
Copy link
Copy Markdown
Contributor

@pan3793 Thank you, Pan, for your work on this. We’ve already discussed it offline, and I approve this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants